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Developed by the APE group, APENet is a new high speed, low latency, 3-dimensional interconnect architecture 
optimized for PC clusters running LQCD-like numerical applications. The hardware implementation is based 
on a single PCI-X 133MHz network interface card hosting six independent bi-directional channels with a peak 
bandwidth of 676 MB/s each direction. We discuss preliminary benchmark results showing exciting performances 
similar or better than those found in high-end commercial network systems. 



1. Overview 

The APE research grouppP has traditionally 
focused on the design and development of cus- 
tom silicon, electronics and software optimized for 
Lattice QCD (LQCD). 

Recent works in LQCD numerical application 
area |2I3I4| have shown an increasing interest on 
clusters of commodity PC's. This is mainly due 
to two facts: good sustained performance of com- 
modity processors on numerical applications and 
slowly emerging low latency, high bandwidth net- 
work interconnects. 

This paper describes APENet, a 3D network of 
point-to-point, low-latency, high-bandwidth links 
well suited for medium sized clusters running nu- 
merical applications. 

2. APENet 

APENet is a 3D network of point-to-point links 
with toroidal boundary conditions. Each Process- 
ing Element (PE), in our case a cluster node, has 



6 full-duplex communication channels (A"+, X , 
Y+, Z+, Z-). 

Data are transmitted in packets which are 
routed to the destination PE according to sim- 
ple — and software overridable — rules. Packet 
delivery is always guaranteed: trasmission is de- 
layed until the receiver has enough room in its 
receive buffers. No external routing device is nec- 
essary: next-neighbour and longer distance com- 
munications are obtained efficently hopping until 
the destination PE is reached, without penalties 
for in-between PEs. 

Latency is kept to the minimum thanks to a 
lightweight low level protocol — just two 64bit 
words for the header and the footer, — and to 
the cut-through architecture of the switching de- 
vice. Within 10 clock cycles from the arrival of 
the header, the receiving channel starts forward- 
ing the packet along its path, either toward local 
buffers — for packets intended for that very PE 
— or toward the proper trasmitting channel — 
for packets which hop away. — 



*Talk given at Lattice 2004 by R.A. 
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Figure 1. The APELink card. 

2.1. The APELink Card 

The building block of the APENet implemen- 
tation is the APELink card, shown in Fig. The 
APELink is a PCI-X 133MHz 64bit card which 
uses an Altera's Stratix device, a last generation 
FPGA, as a the network device controller, and 
six pairs of serializers/deserializcrs from National 
Semiconductors as physical link interfaces. The 
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Figure 2. The APELink functional blocks diagram. 

APELink card, see Fig. |21 is composed of three 
major functional blocks. Each block has its own 
clock domain and all data communications be- 
tween these blocks are based on dual clock FIFOs, 



which guarantee robustness of the hardware it- 
self. The first block is the PCI-X interface, which 
handles the communication with the host PCI- 
X bus; the second block, called crossbar switch, 
controls the data flow among the PCI-X channel 
and the remote communication links; the third 
block implements six remote communication bi- 
directional links. 

2.2. The APENet Software 

The main programming interface is a propri- 
etary simple library (apelib) of C functions, in- 
cluding synchronous, asyncronous and basic col- 
lective functions. It relies on the APELink driver, 
a Linux device driver fully Multi Processor-aware 
(SMP) and supporting versions 2.4 and 2.6 of the 
Linux kernel. We are developing both an MPI 
implementation based on LAM-MPI and a net- 
work device driver, which allows simple IP proto- 
col traffic to be routed on the APENet. 

The apelib is targeted for numerical ap- 
plication code and includes basic primitives 
such as ape_send(), ape_recv(), ape_sndrcv(), 
and some collective functions (ape_broadcast(), 
ape_global_sum()). The ape_sndrcv() primitive 
squeezes the best performances from our architec- 
ture, as it asymptotically exercises two channels 
at once, incrementing the aggregated bandwidth. 

3. Benchmarks 

In this section we report some preliminary low- 
level benchmark results, obtained on APELink 
early prototypes. 

Benchmarks were performed on some dual Intel 
Xeon PC's, with both ServerWorks GC-LE and 
Intel E7501 chipsets. The PC's are connected in a 
small ring topology. The APELink channel speed 
is currently kept at 100 MHz with a peak per- 
formance of 508 MB/s per hnk while the PCI-X 
interface runs at 133MHz. 

The benchmark performs a "ping-pong" data 
transfer (unidirectional and bi-directional) be- 
tween two adjacent PE. In the unidirectional test, 
one PE sends a message to a remote PE then 
blocks on receiving a response. The second PE 
receives the full message and sends back the same 
amount of data. Half round-trip time, averaged 
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Figure 3. Latency as measured in a ping-pong test 
for small packet sizes. 

on a number of iterations, is defined as the la- 
tency, i.e. the message transfer time. 

From the same test we have estimated the sus- 
tained bandwidth. The bi-directional test differs 
from the unidirectional one since both PEs send 
data simultaneously using the ape_sndrcv() func- 
tion. 
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Figure 4. Bandwidth is measured both for uni- 
directional and bidirectional tests. The zero-copy la- 
bel refers to the use of optimized memory buffers. 

In Fig. 131 we plot the latency for message sizes 
ranging from 16 to 16K bytes. The smallest mes- 
sage size is 16 as the minimum packet payload is 
a 128bit word. The estimated latency is ~ 6^s 
and is constant up to 256 bytes size message. 
For 4096 bytes messages we measure 20/is which 



is quite good and pretty similar to commercial 
interconnectsjS] . 

Fig. 21 shows the bandwidth plot with mes- 
sage sizes ranging from 16 bytes to 1MB. The bi- 
directional zero-copy bandwidth saturates at 677 
MB/s. At 1MB message size the uni-directional 
bandwidth is 470MB/s, roughly 90% of the chan- 
nel peak performance at lOOMHz. The plot shows 
two pairs of curves: those marked zero-copy re- 
fer to the use of pinned-down memory, suitable 
to be used for PCI DMA transfers. This way the 
overhead of expensive memory copy operations 
to/from DMA memory buffers are avoided. Non 
zero- copy data are reported only to simplify the 
discussion. 

4. Conclusions 

The hardware design of the APElink card is 
completed and we are running tests on the fi- 
nal release of the board whose link channels 
run at full speed (ISBMHz). Preliminary bench- 
marks have shown encouraging results, compara- 
ble with commercial network interconnects. The 
APELink software is currently in fast progress: 
current activities focus on a better low level driver 
a MPI implementation. 

The INFN prototype APENet PC cluster, com- 
posed of 16 PC's equipped with APElink boards, 
is ready to be used on LQCD test codes. We have 
plans to expand it up to 64 PC's (4'^ topology) in 
the near future. 

REFERENCES 

1. The APE group, Istituto Nazionale di Fisica 
Nucleare 

http : //apegate . romal . inf n . it/APE 

2. M. Lus cher, Nucl. Phys. Proc. Suppl. 106, 21 
(2002) |arXiv:hep-lat/0110007| . 

3. Z. Fodor and S. D. Katz, JHEP 0203, 014 
(2002) ar Xiv:hep-l at/0106002 . 

4. T. Lippert, Nucl. P hys. Proc. Suppl. 129, 88 
(2004) arXiv:hep-lat/0311011 . 

5. J. Liu et al, Performace Comparison of MPI 
Implementations over Infiniband, Myrinet 
and Quadrics, SuperComputing Conference, 
November 2003. 



